As it can be seen in 20_04_08_Ridge.html, there is a relation between the score the model assigns to a gene and the gene’s mean level of expression. This is a problem because we had previously discovered a bias in the SFARI scores related to mean level of expression (Preprocessing/Gandal/AllRegions/RMarkdowns/20_04_03_SFARI_genes.html), which means that this could be a confounding factor in our model and the reason why it seems to perform well, so we need to remove this bias to recover the true biological signal that is mixed with it and improve the quality of our model.
train model with equal weights for all samples
for l in loop:
calculate bias
correct weights to reduce bias
retrain model
Return last model
Parameters:
eta: Learning rate
T: Number of loops
D: Training data
H: Classification model
c: bias constraint
lambda: scaling factor for the weights
\(w_i\) with \(i=1,...,N\): Weights assigned to each sample
Pseudocode:
lambda = 0
w = [1, ..., 1]
c = std(meanExpr(D))
h = train classifier H with lambda and w
for t in 1,,,T do
bias = <h(x), c(x)>
update lambda to lambda - eta*bias
update weights_hat to exp(lambda*mean(c))
update weights to w_hat/(1+w_hat) if y_i=1, 1/(1+w_hat) if y_i=0
update h with new weights
Return h
library(tidyverse)
library(knitr)
library(plotly) ; library(viridis) ; library(gridExtra) ; library(RColorBrewer) ; library(corrplot) ; library(reshape2)
library(biomaRt)
library(Rtsne)
library(caret) ; library(ROCR) ; library(car)
library(polycor)
library(expss) ; library(knitr)
SFARI_colour_hue = function(r) {
pal = c('#FF7631','#FFB100','#E8E328','#8CC83F','#62CCA6','#59B9C9','#b3b3b3','#808080','gray','#d9d9d9')[r]
}
# Clusterings
clustering_selected = 'DynamicHybridMergedSmall'
clusterings = read_csv('./../Data/clusters.csv')
clusterings$Module = clusterings[,clustering_selected] %>% data.frame %>% unlist %>% unname
assigned_module = data.frame('ID' = clusterings$ID, 'Module' = clusterings$Module)
# Original dataset
original_dataset = read.csv(paste0('./../Data/dataset_', clustering_selected, '.csv'), row.names=1)
# Model dataset
load('./../Data/LR_model.RData')
# Regression data
load(file='./../Data/Ridge_model.RData')
# Mean Expression data
load('./../Data/preprocessed_data.RData')
datExpr = datExpr %>% data.frame
DE_info = DE_info %>% data.frame
# Dataset created with DynamicTreeMerged algorithm
clustering_selected = 'DynamicHybridMergedSmall'
original_dataset = read.csv(paste0('./../Data/dataset_', clustering_selected, '.csv'), row.names=1)
# Add gene symbol
getinfo = c('ensembl_gene_id','external_gene_id')
mart = useMart(biomart='ENSEMBL_MART_ENSEMBL', dataset='hsapiens_gene_ensembl',
host='feb2014.archive.ensembl.org')
gene_names = getBM(attributes=getinfo, filters=c('ensembl_gene_id'), values=rownames(dataset), mart=mart)
rm(dds, datGenes, datMeta, clustering_selected, clusterings, mart, getinfo)
Using Demographic Parity as a measure of bias: A fair classifier h should make positive predictions each segment \(G\) of the population at the same rate as in all of the population
This definition is for discrete segments of the population. Since our bias is found across all the population but in different measures depending on the mean level of expression of the gene, we have to adapt this definition to a continuous bias scenario
Demographic Parity for our problem: A fair classifier h should make positive predictions on genes with a certail mean level of expression at the same rate as in all of the genes in the dataset
The original formula for the Demographic Parity bias is
$c(x,0) = 0 $ when the prediction is negative
\(c(x,1) = \frac{g(x)}{Z_G}-1\) when the prediction is positive. Where \(g(x)\) is the Kronecker delta to indicate if the sample belongs to the protected group and \(Z_G\) is the proportion of the population that belongs to the group we want to protect against bias
Using this definitions in our problem:
\(g(x):\) Since all our samples belong to the protected group, this would always be 1
\(Z_G:\) Since all of our samples belong to the protected group, this would also always be 1
So our measure of bias \(c(x,1) = \frac{1}{1}-1 = 0\) for all samples. This doesn’t work, so we need to adapt it to our continous case
We can use \(c(x,1) = std(meanExpr(x))\) as the constraint function, this way, when we calculate the bias of the dataset:
\(h(x)\cdot c(x)\) will only be zero if the positive samples are balanced around the mean expression, and the sign of the bias will indicate the direction of the bias
train_set$SFARI = train_set$SFARI %>% as.factor
set.seed(123)
# Initial parameters
eta = 0.5
Loops = 50
lambda = 0
w = rep(1, nrow(train_set))
h = train(SFARI ~., data = train_set, method = 'glmnet', trControl = trainControl('cv', number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(1, -3, by = -.1)))
mean_expr = data.frame('ID' = rownames(datExpr), 'meanExpr' = rowMeans(datExpr)) %>%
filter(ID %in% rownames(train_set)) %>% right_join(data.frame('ID' = substr(rownames(train_set),1,15)), by = 'ID') %>%
#mutate('meanExpr_std' = (meanExpr-min(meanExpr))/(max(meanExpr)-min(meanExpr)))
mutate('meanExpr_std' = (meanExpr-mean(meanExpr))/sd(meanExpr))
# Track behaviour of plot
bias_vec = c()
acc_vec = c()
for(l in 1:Loops){
# Calculate bias for positive predicted samples
bias = mean(mean_expr$meanExpr_std[predict(h,train_set) %>% as.logical])
# bias = hetcor(predict(h,train_set) %>% as.numeric, mean_expr$meanExpr)$correlations[1,2]
# Update weights
lambda = lambda - eta*bias
w_hat = exp(lambda*mean_expr$meanExpr_std)
w = 1/(1+w_hat)
w[train_set$SFARI %>% as.logical] = w[train_set$SFARI %>% as.logical] * w_hat[train_set$SFARI %>% as.logical]
# Update tracking vars
bias_vec = c(bias_vec, bias)
acc_vec = c(acc_vec, mean(predict(h,train_set) == train_set$SFARI))
# Update h
h = train(SFARI ~., data = train_set, method = 'glmnet', weights = w, trControl = trainControl('cv', number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(1, -3, by = -.1)))
}
The bias decreases until it oscilates around zero and the accuracy is not affected much
plot_info = data.frame('iter' = 1:length(bias_vec), 'bias' = bias_vec, 'accuracy' = acc_vec) %>% melt(id.vars = 'iter')
plot_info %>% ggplot(aes(x=iter, y=value, color = variable)) + geom_line() + theme_minimal()
rm(eta, Loops, l, w_hat, bias)
Since the bias increases the probability of being classified as 1 for genes with higher levels of expression, as the level of expression of a gene increases, the algorithm:
Increases the weight of genes with a negative label
Decreases the weight of genes with a positibe label
w_hat = exp(lambda*mean_expr$meanExpr_std) # inverso a mean expr
w0 = 1/(1+w_hat) # prop a mean expr
w = 1/(1+w_hat)
w[train_set$SFARI %>% as.logical] = w[train_set$SFARI %>% as.logical]*w_hat[train_set$SFARI %>% as.logical] # inv mean expr Positives, prop Negatives
plot_data = data.frame(meanExpr = mean_expr$meanExpr, w_hat = w_hat, w0 = w0, w = w, SFARI = train_set$SFARI, pred = predict(h,train_set))
plot_data %>% ggplot(aes(meanExpr, w, color = SFARI)) + geom_point(alpha = 0.3) + ylab('weight') + xlab('Mean Expression') +
ggtitle('Weights of the final model') + ylim(c(0,1)) + theme_minimal()
The relation is not completely gone, there seems to be a negative relation for the genes with the lowest levels of expression.
Even though the trend line is not as flat as with the first method, we are not fixing this directly as we were doing before, this is now just a consequence of the corrections we did inside of the model, so it makes sense for it to be less exact than before
# Correct Bias
predictions = h %>% predict(test_set, type='prob')
test_set$corrected_score = predictions$`TRUE`
test_set$corrected_pred = test_set$corrected_score>0.5
# Plot results
plot_data = data.frame('ID'=rownames(datExpr), 'meanExpr'=rowMeans(datExpr)) %>%
right_join(test_set %>% mutate(ID=rownames(test_set)), by='ID')
plot_data %>% ggplot(aes(meanExpr, corrected_score)) + geom_point(alpha=0.1, color='#0099cc') +
#geom_smooth(method='lm', color='#999999', se = FALSE, alpha=0.6) +
geom_smooth(method='gam', color='gray', alpha=0.2) + ylab('Corrected Score') + xlab('Mean Expression') +
theme_minimal() + ggtitle('Mean expression vs Model score corrected using adjusted weights')
conf_mat = test_set %>% apply_labels(SFARI = 'Actual Labels',
corrected_score = 'Corrected Score',
corrected_pred = 'Corrected Label Prediction')
cro(conf_mat$SFARI, list(conf_mat$corrected_pred, total()))
|  Corrected Label Prediction |  #Total | |||
|---|---|---|---|---|
| Â FALSEÂ | Â TRUEÂ | Â | ||
|  Actual Labels | ||||
| Â Â Â FALSEÂ | 7195 | 5062 | Â | 12257 |
| Â Â Â TRUEÂ | 79 | 98 | Â | 177 |
|    #Total cases | 7274 | 5160 |  | 12434 |
rm(conf_mat)
The accuracy was expected to decrease because the bias was helping classify samples correctly, but for the wrong reasons
old_acc = mean(test_set$SFARI==(test_set$prob>0.5))
acc = mean(test_set$SFARI==(test_set$corrected_pred))
cat(paste0('Accuracy = ', round(acc,4)))
## Accuracy = 0.5865
cat(paste0('Accuracy decreased ',round(old_acc-acc,4), ' points'))
## Accuracy decreased 0.0671 points
rm(acc, old_acc)
pred_ROCR = prediction(test_set$corrected_score, test_set$SFARI)
roc_ROCR = performance(pred_ROCR, measure='tpr', x.measure='fpr')
AUC = performance(pred_ROCR, measure='auc')@y.values[[1]]
plot(roc_ROCR, main=paste0('ROC curve (AUC=',round(AUC,2),')'), col='#009999')
abline(a=0, b=1, col='#666666')
rm(roc_ROCR, AUC)
lift_ROCR = performance(pred_ROCR, measure='lift', x.measure='rpp')
plot(lift_ROCR, main='Lift curve', col='#86b300')
rm(lift_ROCR, pred_ROCR)
Looks very similar to before, the means of each group are a bit closer together
plot_data = test_set %>% dplyr::select(corrected_score, SFARI)
ggplotly(plot_data %>% ggplot(aes(corrected_score, fill=SFARI, color=SFARI)) + geom_density(alpha=0.3) + xlab('Score') +
geom_vline(xintercept = mean(plot_data$corrected_score[plot_data$SFARI]), color = '#00C0C2', linetype = 'dashed') +
geom_vline(xintercept = mean(plot_data$corrected_score[!plot_data$SFARI]), color = '#FF7371', linetype = 'dashed') +
theme_minimal() + ggtitle('Model score distribution by SFARI Label'))
The positive relation between SFARI scores and Model scores is still there but is not as strong as before
plot_data = test_set %>% mutate(ID=rownames(test_set)) %>% dplyr::select(ID, corrected_score) %>%
left_join(original_dataset %>% mutate(ID=rownames(original_dataset)), by='ID') %>%
dplyr::select(ID, corrected_score, gene.score) %>% apply_labels(gene.score='SFARI Gene score')
cro(plot_data$gene.score)
|  #Total | |
|---|---|
|  SFARI Gene score | |
| Â Â Â 1Â | 5 |
| Â Â Â 2Â | 12 |
| Â Â Â 3Â | 38 |
| Â Â Â 4Â | 86 |
| Â Â Â 5Â | 32 |
| Â Â Â 6Â | 4 |
|    None | 12257 |
|    #Total cases | 12434 |
ggplotly(plot_data %>% ggplot(aes(gene.score, corrected_score, fill=gene.score)) + geom_boxplot() +
scale_fill_manual(values=SFARI_colour_hue(r=c(1:6,8,7))) +
ggtitle('Distribution of the Model scores by SFARI score') +
xlab('SFARI score') + ylab('Model score') + theme_minimal())
Print genes with highest corrected scores in test set
We lost all the genes with a score of 1, so they probably were in the top because of their high level of expression
Only one SFARI gene remains, with a score of 3 (1:50) is still higher than (1:67), so there is still a larger concentration of SFARI genes than in the general dataset, but this result is not as robust as before
Some genes with positive Gene Significance or positive Module-diagnosis correlation are now in the top 50, before they were all negative
test_set %>% dplyr::select(corrected_score, SFARI) %>% mutate(ID = rownames(test_set)) %>%
arrange(desc(corrected_score)) %>% top_n(50, wt=corrected_score) %>%
left_join(original_dataset %>% mutate(ID=rownames(original_dataset)), by='ID') %>%
left_join(gene_names, by = c('ID'='ensembl_gene_id')) %>%
dplyr::rename('GeneSymbol' = external_gene_id, 'Probability' = corrected_score,
'ModuleDiagnosis_corr' = MTcor, 'GeneSignificance' = GS) %>%
mutate(ModuleDiagnosis_corr = round(ModuleDiagnosis_corr,4), Probability = round(Probability,4),
GeneSignificance = round(GeneSignificance,4)) %>%
dplyr::select(GeneSymbol, GeneSignificance, ModuleDiagnosis_corr, Module, Probability, gene.score) %>%
kable(caption = 'Genes with highest model probabilities from the test set')
| GeneSymbol | GeneSignificance | ModuleDiagnosis_corr | Module | Probability | gene.score |
|---|---|---|---|---|---|
| SNX25 | 0.0951 | -0.0094 | #00A7FF | 0.7934 | None |
| SCN3B | 0.0668 | 0.1127 | #FF62BC | 0.7927 | None |
| HIVEP2 | 0.0165 | -0.9514 | #00C0AF | 0.7903 | None |
| MIDN | -0.0520 | -0.6031 | #00BA38 | 0.7835 | None |
| ARHGAP20 | -0.0230 | -0.9514 | #00C0AF | 0.7818 | None |
| MCOLN1 | -0.1546 | -0.9514 | #00C0AF | 0.7779 | None |
| CLMP | 0.0344 | -0.0094 | #00A7FF | 0.7748 | None |
| CELF2 | -0.0605 | -0.9514 | #00C0AF | 0.7743 | None |
| PLXNC1 | -0.0088 | -0.0094 | #00A7FF | 0.7713 | None |
| KCNJ6 | -0.1379 | -0.9514 | #00C0AF | 0.7701 | None |
| ADAMTSL1 | 0.0663 | -0.9514 | #00C0AF | 0.7673 | None |
| PACSIN3 | -0.0825 | -0.9514 | #00C0AF | 0.7671 | None |
| SIK2 | 0.1631 | 0.1127 | #FF62BC | 0.7647 | None |
| CSMD3 | 0.0768 | 0.2525 | #C99800 | 0.7620 | None |
| RALYL | 0.0077 | -0.9514 | #00C0AF | 0.7615 | None |
| KIAA1217 | -0.1868 | -0.9514 | #00C0AF | 0.7614 | None |
| NEK10 | -0.0444 | -0.9514 | #00C0AF | 0.7601 | None |
| TBC1D8 | 0.0969 | 0.7287 | #39B600 | 0.7591 | None |
| EGR1 | 0.2192 | -0.0094 | #00A7FF | 0.7585 | None |
| KCNT2 | 0.0170 | -0.0094 | #00A7FF | 0.7569 | None |
| GARNL3 | -0.0933 | -0.5355 | #FD61D1 | 0.7562 | None |
| SERPING1 | 0.2991 | 0.8742 | #E58700 | 0.7550 | None |
| ASAP1 | -0.0896 | -0.6031 | #00BA38 | 0.7542 | None |
| CREB3L1 | -0.0234 | -0.0745 | #6BB100 | 0.7542 | None |
| HECTD2 | 0.1862 | -0.0094 | #00A7FF | 0.7535 | None |
| TADA2A | -0.0020 | 0.7916 | #00C097 | 0.7532 | None |
| ITGAX | -0.0073 | -0.0094 | #00A7FF | 0.7532 | None |
| SLITRK1 | -0.0644 | -0.9514 | #00C0AF | 0.7512 | None |
| IKBKAP | -0.0036 | -0.5355 | #FD61D1 | 0.7507 | None |
| ACSM5 | 0.1248 | -0.0094 | #00A7FF | 0.7507 | None |
| CPXM2 | 0.4603 | 0.8742 | #E58700 | 0.7503 | None |
| AQP1 | -0.0924 | -0.9514 | #00C0AF | 0.7478 | None |
| PLD4 | 0.1010 | 0.7287 | #39B600 | 0.7478 | None |
| EDC4 | 0.0696 | -0.2161 | #F8766D | 0.7467 | None |
| ARHGEF3 | -0.0487 | -0.9514 | #00C0AF | 0.7453 | None |
| KCNA6 | -0.1125 | -0.9514 | #00C0AF | 0.7449 | None |
| PTPRO | 0.1187 | -0.0094 | #00A7FF | 0.7444 | None |
| PLEKHA5 | -0.0061 | -0.9514 | #00C0AF | 0.7443 | None |
| ARC | 0.2517 | -0.0094 | #00A7FF | 0.7440 | None |
| EMD | -0.0474 | -0.9514 | #00C0AF | 0.7434 | None |
| DIP2A | 0.0170 | -0.0094 | #00A7FF | 0.7425 | 3 |
| PSME4 | 0.2145 | -0.0094 | #00A7FF | 0.7416 | None |
| PDE7A | 0.1339 | 0.7916 | #00C097 | 0.7409 | None |
| MOG | -0.0290 | -0.9514 | #00C0AF | 0.7406 | None |
| EVL | -0.1527 | -0.9514 | #00C0AF | 0.7405 | None |
| C3 | 0.3945 | 0.7287 | #39B600 | 0.7404 | None |
| CCRN4L | -0.0285 | -0.0094 | #00A7FF | 0.7394 | None |
| LEPREL1 | 0.2452 | 0.7916 | #00C097 | 0.7393 | None |
| LRRC7 | 0.1862 | -0.9514 | #00C0AF | 0.7384 | None |
| CDH18 | 0.1860 | -0.9514 | #00C0AF | 0.7377 | None |
negative_set = test_set %>% filter(!SFARI)
negative_set %>% mutate(diff = abs(prob-corrected_score)) %>%
ggplot(aes(prob, corrected_score, color = diff)) + geom_point(alpha=0.2) + scale_color_viridis() +
geom_abline(slope=1, intercept=0, color='gray', linetype='dashed') +
geom_smooth(color='#666666', alpha=0.5, se=TRUE, size=0.5) + coord_fixed() +
xlab('Original probability') + ylab('Corrected probability') + theme_minimal() + theme(legend.position = 'none')
negative_set_table = negative_set %>% mutate(corrected_pred = corrected_score>0.5) %>%
apply_labels(corrected_score = 'Corrected Probability',
corrected_pred = 'Corrected Class Prediction',
pred = 'Original Class Prediction')
cro(negative_set_table$pred, list(negative_set_table$corrected_pred, total()))
|  Corrected Class Prediction |  #Total | |||
|---|---|---|---|---|
| Â FALSEÂ | Â TRUEÂ | Â | ||
|  Original Class Prediction | ||||
| Â Â Â FALSEÂ | 5813 | 2218 | Â | 8031 |
| Â Â Â TRUEÂ | 1382 | 2844 | Â | 4226 |
|    #Total cases | 7195 | 5062 |  | 12257 |
cat(paste0('\n', round(100*mean(negative_set_table$corrected_pred == negative_set_table$pred)),
'% of the genes maintained their original predicted class'))
##
## 71% of the genes maintained their original predicted class
rm(negative_set_table)
*The transparent verison of the trend line is the original trend line
The relation is the opposite as before, the higher the Gene Significance, the higher the score, with the lowest scores corresponding to under-expressed genes
negative_set %>% ggplot(aes(corrected_score, GS, color=MTcor)) + geom_point() + geom_smooth(method='gam', color='#666666') +
geom_line(stat='smooth', method='gam', color='#666666', alpha=0.5, size=1.2, aes(x=prob)) +
geom_hline(yintercept=mean(negative_set$GS), color='gray', linetype='dashed') +
scale_color_gradientn(colours=c('#F8766D','white','#00BFC4')) + xlab('Corrected Score') +
ggtitle('Relation between the Model\'s Corrected Score and Gene Significance') + theme_minimal()
Summarised version of score vs mean expression, plotting by module instead of by gene
The difference in the trend lines between this plot and the one above is that the one above takes all the points into consideration while this considers each module as an observation by itself, so the top one is strongly affected by big modules and the bottom one treats all modules the same
The transparent version of each point and trend lines are the original values and trends before the bias correction
plot_data = negative_set %>% group_by(MTcor) %>% summarise(mean = mean(prob), sd = sd(prob),new_mean = mean(corrected_score),
new_sd = sd(corrected_score), n = n()) %>%
mutate(MTcor_sign = ifelse(MTcor>0, 'Positive', 'Negative')) %>% left_join(original_dataset, by='MTcor') %>%
dplyr::select(Module, MTcor, MTcor_sign, mean, new_mean, sd, new_sd, n) %>% distinct()
colnames(plot_data)[1] = 'ID'
ggplotly(plot_data %>% ggplot(aes(MTcor, new_mean, size=n, color=MTcor_sign)) + geom_point(aes(id = ID)) +
geom_smooth(method='loess', color='gray', se=FALSE) + geom_smooth(method='lm', se=FALSE) +
geom_point(aes(y=mean), alpha=0.3) + xlab('Module-Diagnosis correlation') + ylab('Mean Corrected Score by the Model') +
geom_line(stat='smooth', method='loess', color='gray', se=FALSE, alpha=0.3, size=1.2, aes(y=mean)) +
geom_line(stat='smooth', method='lm', se=FALSE, alpha=0.3, size=1.2, aes(y=mean)) +
theme_minimal() + theme(legend.position='none'))
Check if correcting by gene also corrected by module: Yes, the bias seems to be removed completely, it may even be a bit negative now
mean_and_sd = data.frame(ID=rownames(datExpr), meanExpr=rowMeans(datExpr), sdExpr=apply(datExpr,1,sd))
plot_data = negative_set %>% mutate(ID=rownames(test_set)[!test_set$SFARI]) %>% left_join(mean_and_sd, by='ID') %>%
left_join(original_dataset %>% mutate(ID=rownames(original_dataset)) %>%
dplyr::select(ID, Module), by='ID')
plot_data2 = plot_data %>% group_by(Module) %>% summarise(meanExpr = mean(meanExpr), meanProb = mean(prob),
new_meanProb = mean(corrected_score), n=n())
ggplotly(plot_data2 %>% ggplot(aes(meanExpr, new_meanProb, size=n)) +
geom_point(color=plot_data2$Module) + geom_point(color=plot_data2$Module, alpha=0.3, aes(y=meanProb)) +
geom_smooth(method='loess', se=TRUE, color='gray', alpha=0.1, size=0.7) +
geom_line(stat='smooth', method='loess', se=TRUE, color='gray', alpha=0.4, size=1.2, aes(y=meanProb)) +
theme_minimal() + theme(legend.position='none') + xlab('Mean Expression') + ylab('Corrected Probability') +
ggtitle('Mean expression vs corrected Model score by Module'))
rm(plot_data2, mean_and_sd)
The relation between SD and score became bigger than before
plot_data %>% ggplot(aes(sdExpr, corrected_score)) + geom_point(alpha=0.1, color='#0099cc') +
geom_smooth(method='lm', color='#999999', se=FALSE, alpha=1) + xlab('SD') + ylab('Corrected Probability') +
geom_line(stat='smooth', method='lm', color='#999999', se=FALSE, alpha=0.4, size=1.5, aes(y=prob)) +
theme_minimal() + ggtitle('SD vs model probability by gene') + scale_x_sqrt()
For under-expressed genes, the relation between LFC and probability got inverted. The difference is quite big
For over-expressed genes, the trend didn’t change, it just got translated higher. Now in general, over-expressed genes have higher probabilities than over-expressed genes
plot_data = negative_set %>% mutate(ID=rownames(test_set)[!test_set$SFARI]) %>%
left_join(DE_info %>% mutate(ID=rownames(DE_info)), by='ID')
plot_data %>% ggplot(aes(log2FoldChange, corrected_score)) + geom_point(alpha=0.1, color='#0099cc') +
geom_smooth(method='loess', color='gray', alpha=0.1) + xlab('LFC') + ylab('Corrected Probability') +
geom_line(stat='smooth', method='loess', color='gray', alpha=0.4, size=1.5, aes(y=prob)) +
theme_minimal() + ggtitle('LFC vs model probability by gene')
The trends for DE genes don’t seem to have changed much, they were just translated, upwards for over-expressed genes, and downwards for under-expressed genes
The big difference we see in the plot above is just the connection between the patterns for under and over-expressed genes
The only group that doesn’t seem to be affected much is the non-DE group of under-expressed genes
p1 = plot_data %>% filter(log2FoldChange<0) %>% mutate(DE = padj<0.05) %>% ggplot(aes(log2FoldChange, corrected_score, color=DE)) + geom_point(alpha=0.1) +
geom_smooth(method='loess', alpha=0.1) + xlab('') + ylab('Corrected Probability') +
ylim(c(min(plot_data$corrected_score), max(plot_data$corrected_score))) +
geom_line(stat='smooth', method='loess', alpha=0.4, size=1.5, aes(y=prob, color = DE)) +
theme_minimal() + theme(legend.position = 'none', plot.margin=unit(c(1,-0.3,1,1), 'cm'))
p2 = plot_data %>% filter(log2FoldChange>=0) %>% mutate(DE = padj<0.05) %>% ggplot(aes(log2FoldChange, corrected_score, color=DE)) + geom_point(alpha=0.1) +
geom_smooth(method='loess', alpha=0.1) + xlab('') + ylab('Corrected Probability') + ylab('') +
scale_y_continuous(position = 'right', limits = c(min(plot_data$corrected_score), max(plot_data$corrected_score))) +
geom_line(stat='smooth', method = 'loess', alpha=0.4, size=1.5, aes(y = prob, color = DE)) +
theme_minimal() + theme(plot.margin = unit(c(1,1,1,-0.3), 'cm'), axis.ticks.y = element_blank())
grid.arrange(p1, p2, nrow=1, top = 'LFC vs model probability by gene', bottom = 'LFC')
rm(p1, p2)
The scores decreased for modules with negative correlation and increased for modules with positive correlation
module_score = negative_set %>% mutate(ID=rownames(test_set)[!test_set$SFARI]) %>%
left_join(original_dataset %>% mutate(ID = rownames(original_dataset)), by='ID') %>%
dplyr::select(ID, prob, corrected_score, Module, MTcor.x) %>% rename(MTcor = MTcor.x) %>%
left_join(data.frame(MTcor=unique(dataset$MTcor)) %>% arrange(by=MTcor) %>%
mutate(order=1:length(unique(dataset$MTcor))), by='MTcor')
ggplotly(module_score %>% ggplot(aes(MTcor, corrected_score)) + geom_point(color=module_score$Module, aes(id=ID, alpha=corrected_score^4)) +
geom_hline(yintercept=mean(module_score$corrected_score), color='gray', linetype='dotted') +
geom_line(stat='smooth', method = 'loess', color='gray', alpha=0.5, size=1.5, aes(x=MTcor, y=prob)) +
geom_smooth(color='gray', method = 'loess', se = FALSE, alpha=0.3) + theme_minimal() +
xlab('Module-Diagnosis correlation') + ylab('Corrected Score'))
This bias correction makes biggest changes in the distribution of the probabilities than the post-processing one. Its main effect seems to be to reduce the importance of the under-expressed genes and increase the importance of over-expressed genes
write.csv(dataset, file='./../Data/BC_weighting_approach.csv', row.names = TRUE)
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] expss_0.10.2 polycor_0.7-10 car_3.0-7 carData_3.0-3
## [5] ROCR_1.0-7 gplots_3.0.3 caret_6.0-86 lattice_0.20-41
## [9] Rtsne_0.15 biomaRt_2.40.5 reshape2_1.4.3 corrplot_0.84
## [13] RColorBrewer_1.1-2 gridExtra_2.3 viridis_0.5.1 viridisLite_0.3.0
## [17] plotly_4.9.2 knitr_1.28 forcats_0.5.0 stringr_1.4.0
## [21] dplyr_0.8.5 purrr_0.3.3 readr_1.3.1 tidyr_1.0.2
## [25] tibble_3.0.0 ggplot2_3.3.0 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.5
## [3] Hmisc_4.4-0 plyr_1.8.6
## [5] lazyeval_0.2.2 splines_3.6.3
## [7] crosstalk_1.1.0.1 BiocParallel_1.18.1
## [9] GenomeInfoDb_1.20.0 digest_0.6.25
## [11] foreach_1.5.0 htmltools_0.4.0
## [13] gdata_2.18.0 fansi_0.4.1
## [15] magrittr_1.5 checkmate_2.0.0
## [17] memoise_1.1.0 cluster_2.1.0
## [19] openxlsx_4.1.4 annotate_1.62.0
## [21] recipes_0.1.10 modelr_0.1.6
## [23] gower_0.2.1 matrixStats_0.56.0
## [25] prettyunits_1.1.1 jpeg_0.1-8.1
## [27] colorspace_1.4-1 blob_1.2.1
## [29] rvest_0.3.5 haven_2.2.0
## [31] xfun_0.12 crayon_1.3.4
## [33] RCurl_1.98-1.1 jsonlite_1.6.1
## [35] genefilter_1.66.0 survival_3.1-11
## [37] iterators_1.0.12 glue_1.3.2
## [39] gtable_0.3.0 ipred_0.9-9
## [41] zlibbioc_1.30.0 XVector_0.24.0
## [43] DelayedArray_0.10.0 shape_1.4.4
## [45] BiocGenerics_0.30.0 abind_1.4-5
## [47] scales_1.1.0 DBI_1.1.0
## [49] Rcpp_1.0.4 xtable_1.8-4
## [51] progress_1.2.2 htmlTable_1.13.3
## [53] foreign_0.8-75 bit_1.1-15.2
## [55] Formula_1.2-3 stats4_3.6.3
## [57] lava_1.6.7 prodlim_2019.11.13
## [59] glmnet_3.0-2 htmlwidgets_1.5.1
## [61] httr_1.4.1 acepack_1.4.1
## [63] ellipsis_0.3.0 farver_2.0.3
## [65] pkgconfig_2.0.3 XML_3.99-0.3
## [67] nnet_7.3-13 dbplyr_1.4.2
## [69] locfit_1.5-9.4 labeling_0.3
## [71] tidyselect_1.0.0 rlang_0.4.5
## [73] AnnotationDbi_1.46.1 munsell_0.5.0
## [75] cellranger_1.1.0 tools_3.6.3
## [77] cli_2.0.2 generics_0.0.2
## [79] RSQLite_2.2.0 broom_0.5.5
## [81] evaluate_0.14 yaml_2.2.1
## [83] ModelMetrics_1.2.2.2 bit64_0.9-7
## [85] fs_1.4.0 zip_2.0.4
## [87] caTools_1.18.0 nlme_3.1-147
## [89] xml2_1.2.5 compiler_3.6.3
## [91] rstudioapi_0.11 png_0.1-7
## [93] curl_4.3 e1071_1.7-3
## [95] reprex_0.3.0 geneplotter_1.62.0
## [97] stringi_1.4.6 highr_0.8
## [99] Matrix_1.2-18 vctrs_0.2.4
## [101] pillar_1.4.3 lifecycle_0.2.0
## [103] data.table_1.12.8 bitops_1.0-6
## [105] GenomicRanges_1.36.1 latticeExtra_0.6-29
## [107] R6_2.4.1 KernSmooth_2.23-16
## [109] rio_0.5.16 IRanges_2.18.3
## [111] codetools_0.2-16 MASS_7.3-51.5
## [113] gtools_3.8.2 assertthat_0.2.1
## [115] SummarizedExperiment_1.14.1 DESeq2_1.24.0
## [117] withr_2.1.2 S4Vectors_0.22.1
## [119] GenomeInfoDbData_1.2.1 mgcv_1.8-31
## [121] parallel_3.6.3 hms_0.5.3
## [123] grid_3.6.3 rpart_4.1-15
## [125] timeDate_3043.102 class_7.3-16
## [127] rmarkdown_2.1 pROC_1.16.2
## [129] base64enc_0.1-3 Biobase_2.44.0
## [131] lubridate_1.7.4